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Abstract 

In  this  paper  we  explore  the  performance  of  a  hybrid,  or  mixed-mode  (MPI-OpenMP),  parallel 
C++  implementation  versus  a  direct  MPI  implementation.  This  case-study  provides  sufficient 
amount  of  detail  so  it  can  be  used  as  a  point  of  departure  for  further  research  or  as  an  educational 
resource  for  additional  code  development  regarding  the  study  of  mixed-mode  versus  direct  MPI 
implementations.  The  hardware  test-bed  was  a  64-processor  cluster  featuring  16  multi-core 
nodes  with  four  cores  per  node.  The  algorithm  being  benchmarked  is  a  parallel  cyclic 
convolution  algorithm  with  no  inter-node  communication  that  tightly  matches  our  particular 
cluster  architecture.  In  this  particular  case-study  a  time-domain-based  cyclic  convolution 
algorithm  was  used  in  each  parallel  subsection.  Time  domain-based  implementations  are  slower 
than  frequency  domain-based  implementations,  but  give  the  exact  integer  result  when  performing 
very  large,  purely  integer,  cyclic  convolution.  This  is  important  in  certain  fields  where  the  round¬ 
off  errors  introduced  by  the  FFTs  are  not  acceptable.  A  scalability  study  was  carried  out  where 
we  varied  the  signal  length  for  two  different  sets  of  parallel  cores.  By  using  MPI  for  distributing 
the  data  to  the  nodes  and  then  using  OpenMP  to  distribute  data  among  the  cores  inside  each 
node,  we  can  match  the  architecture  of  our  algorithm  to  the  architecture  of  the  cluster.  Each  core 
processes  an  identical  program  with  different  data  using  a  single  program  multiple  data  (SPMD) 
approach.  All  pre  and  post-processing  tasks  were  performed  at  the  master  node.  We  found  that 
the  MPI  implementation  had  a  slightly  better  performance  than  the  hybrid,  MPI-OpenMP 
implementation.  We  established  that  the  speedup  increases  very  slowly,  as  the  signal  size 
increases,  in  favor  of  the  MPI-only  approach.  Even  though  the  mixed-mode  approach  matches 
the  target  architecture  there  is  an  advantage  for  the  MPI  approach.  This  is  consistent  with  what  is 
reported  in  the  literature  since  there  are  no  unbalancing  problems,  or  other  issues,  in  the  MPI 
portion  of  the  algorithm. 


Introduction 

In  certain  fields,  where  the  round-off  errors  introduced  by  the  FFTs  are  not  acceptable,  time 
domain-based  implementations  of  cyclic  convolution  guarantee  the  exact  integer  result  when 
performing  very  large,  purely  integer,  cyclic  convolution.  The  trade-off  is  that  these 
implementations  are  slower  than  frequency  domain-based  implementations.  Applications  that 
can  benefit  from  this  approach  include  multiplication  of  large  integers,  computational  number 
theory,  computer  algebra  and  others  ’  .  The  proposed,  time  domain-based,  parallel 
implementation  can  be  considered  as  complementary  to  other  techniques,  such  as  Nussbaumer 
convolution3  and  Number  Theoretic  Transforms4,  which  can  also  guarantee  the  exact  integer 
result  but  could  have  different  length-restrictions  on  the  sequences. 

Parallel  implementations  in  cluster  architectures  of  time  domain-based,  purely  integer,  cyclic 
convolution  of  large  sequences  resulted  much  faster  than  the  direct,  time  domain-based,  0(N2) 
implementation  in  a  single  processor.  This  is  not  the  case  for  frequency  domain-based 
implementations  where  the  direct  implementation  in  a  single  processor  is  usually  faster  than  the 
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parallel  formulation,  and  therefore  preferably,  unless  memory  limitations  or  round-off  errors 
become  an  issue  as  it  happens  with  the  mentioned  applications. 

The  algorithm  being  benchmarked  is  a  parallel  cyclic  convolution  algorithm  with  no 
interprocessor  communication.  We  selected  this  algorithm  because  it  strongly  matches  our 
particular  cluster  architecture  and,  at  the  same  time,  is  amenable  to  a  mixed-mode  (MPI- 
OpenMP)  implementation  as  well  as  to  a  direct  MPI  implementation.  In  the  past,  different 
variants  for  this  algorithm  were  developed5’6  and  its  use  within  different  hardware 
implementations  was  proposed  ’  ’  .  We  have  found  no  studies  regarding  the  implementation  of 
this  algorithm  in  cluster  architectures.  While  further  benchmarks  and  scalability  studies  will  be 
reported  elsewhere,  in  this  paper  we  are  focusing  in  a  MPI-only  versus  a  mixed-mode  (MPI- 
OpenMP)  parallel  implementation,  including  performance  and  scalability  studies,  carried  out  in 
our  16-node,  64  processor  cluster. 

Based  on  the  prime  factor  decomposition  of  the  signal  length  this  algorithm,  which  is  based  on  a 
block  diagonal  factorization  of  the  circulant  matrices,  breaks  a  one-dimensional  cyclic 
convolution  into  shorter  cyclic  sub-convolutions.  The  subsections  can  be  processed, 
independently,  either  in  serial  or  parallel  mode.  The  only  requirement  is  that  the  signal  length,  N, 
admits  at  least  an  integer,  r0>  as  a  factor;  N  =  r„.s.  The  Argawal-Cooley  Cyclic  Convolution 
algorithm,  has  a  similar  capability  but  requires  that  the  signal  length  can  be  factored  into 
mutually  prime  factors;  N  =  r0.s  with  (r0,s)  =  1.  Since  the  block  pseudocirculant  algorithm  is  not 
restricted  by  the  mutually  prime  constraint,  it  can  be  implemented  recursively  using  the  same 
factor6'7.  The  parallel  technique,  compounded  with  a  serial  recursive  approach  in  each  parallel 
subsection,  provides  a  sublinear  increase  in  performance  versus  the  serial-recursive 
implementation  in  a  single  core. 

For  our  scalability  studies  we  first  used  16  cores  at  four  cores  per  node.  We  accessed  the  16 
cores  directly  using  MPI  and  then,  for  the  hybrid  approach,  we  accessed  four  nodes  using  MPI 
followed  by  using  OpenMP  to  access  the  four  cores  in  each  node.  We  repeated  the  computation 
for  several  signal  lengths.  We  then  used  64  cores.  We  accessed  the  64  cores  directly  using  MPI 
and  then,  for  the  hybrid  approach,  we  accessed  16  nodes  using  MPI  followed  by  using  OpenMP 
to  access  the  four  cores  in  each  node.  Again,  several  signal  lengths  were  used.  At  each  parallel 
core  the  algorithm  was  run  in  a  serial-recursive  mode  until  the  recursion  became  more  expensive 
than  directly  computing  the  sub-convolution  for  the  attained  sub-length.  The  details  of  the  serial- 
recursive  implementation  will  be  reported  elsewhere. 

We  start  by  providing,  as  stated  in  the  literature6,  the  mathematical  framework  for  the  algorithm 
using  a  tensor  product  formulation.  We  then  develop  a  block  matrix  factorization  that  includes 
pre-processing,  post-processing  and  parallel  stages.  The  parallel  stage  is  defined  through  a  block 
diagonal  matrix  factorization6.  The  algorithm  block  diagram  is  clearly  developed  through  a  one- 
to-one  mapping  of  each  block  to  the  algorithm  block  matrix  formulation.  We  follow  by  stating 
the  benchmark  setup,  benchmark  results  and  conclusions. 

Algorithm  Description 

Here  we  briefly  describe  the  sectioned  algorithm  used  for  this  benchmark  as  reported  in  the 
literature5'6.  In  this  particular  implementation,  the  sub-convolutions  are  performed  using  a 
recursive  time  domain-based  cyclic  convolution  algorithm  in  order  to  avoid  round-off  errors.  The 
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proposed  algorithmic  implementation  does  not  require  communication  among  cores  but  involves 
initial  and  final  data  distribution  at  the  pre-processing  and  post-processing  stages. 

Cyclic  convolution  is  a  established  technique  broadly  used  in  signal  processing  applications. 
The  Discrete  Cosine  Transform  and  the  Discrete  Fourier  Transform,  for  example,  can  be 
formulated  and  implemented  as  cyclic  convolutions6'7.  In  particular,  parallel  processing  of  cyclic 
convolution  has  potential  advantages  in  terms  of  speed  and/or  access  to  extended  memory,  but 
requires  breaking  the  original  cyclic  convolution  into  independent  sub-convolution  sections.  The 
Agarwal-Cooley  cyclic  convolution  algorithm  is  suitable  to  this  task  but  requires  that  the 
convolution  length  be  factored  into  mutually  prime  factors,  thus  imposing  a  tight  constraint  on  its 
application  .  There  are  also  multidimensional  methods  but  they  may  require  the  lengths  along 
some  dimensions  to  be  doubled3.  There  other  recursive  methods,  which  are  also  constrained  in 

a 

terms  of  signal  length  .  When  the  mentioned  constraints  are  not  practical,  this  algorithm  provides 
a  complementary  alternative  since  it  only  requires  that  the  convolution  length  be  factorable5'6. 
This  technique,  depending  on  the  prime  factor  decomposition  of  the  signal  length,  can  be 
combined  with  the  Agarwal-Cooley  algorithm  or  with  other  techniques. 

By  conjugating  the  circulant  matrix  with  stride  permutations  a  block  pseudocirculant  formulation 
is  obtained.  Each  circular  sub-block  can  be  independently  processed  as  a  cyclic  sub-convolution. 
Recursion  can  be  applied,  in  either  a  parallel,  serial  or  combined  fashion,  by  using  the  same 
technique  in  each  cyclic  sub-convolution.  The  final  sub-convolutions  at  the  bottom  of  the 
recursion  can  be  performed  using  any  method.  The  basic  formulation  of  the  algorithm  as  stated 
in  the  literature  is  as  follows6, 

The  modulo-N  cyclic  convolution  of  the  length-N  sequences  x[n]  and  h[n]  is  defined  by, 


N-l 

(1) 

k=0 


y[n]  =  X  x[k]h[((n  -k))N] 


which  can  be  formulated  in  matrix  form  as, 


y  =  HNx 


(2) 


where  HN  is  the  circulant  matrix, 
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Given  N  =  r0.s,  and  conjugating  the  circulant  matrix,  HN,  with  stride-by-r0  permutations  we 
obtain, 


y  =  Hnx 

^N,rn  y  =  F*N,rnHNPN,r  PN,r  X 


^  prf)  —  Pn,HNPN,, 

The  decimated-by-r0  input  and  output  sequences  are  written  as, 

v  =  p  V,  xr  =  PMrx 

•N  AN,r0J»  r0  N,r0 


(4) 

(5) 

(6) 

(7) 


The  conjugated  circulant  matrix  has  the  form  of  a  Block  Pseudocirculant  Matrix 6,  represented  as 
Hpr0.  Block  Pseudocirculant  Matrices  have  the  circulant  sub-blocks  above  the  diagonal 
multiplied  by  a  cyclic  shift  operator.  Equation  (5)  can  be  re-written  using  (6)  and  (7)  as, 


yr  -  H  x 

J  ro  Pro  r0 

The  Block  Pseudocirculant  in  (8)  is  written  as5’6, 
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Sna-o  is  the  Cyclic  Shift  Operator  Matrix  defined  by, 
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All  sub-convolutions,  represented  by  the  sub-blocks  in  the  block  pseudcirculant  matrix,  can  be 
processed  in  parallel.  The  cyclic  shifts  above  the  diagonal  represent  a  circular  redistribution  of 
each  sub-convolution  result. 


Example:  For  N  =  4,  r0=  2,  HN=  H4,  (5)  becomes, 
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Where  P4.2  is  a  stride-by-2  permutation  matrix, 


P4,2  -  P4,2  _ 


0  0  0 
0  1  0 
1  0  0 
0  0  1 


Applying  (6)  and  (7),  (11)  becomes, 
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where  the  circulant  matrix  H4  in  (11)  has  become  blocked  in  a  pseudocirculant  fashion  and  can 
be  written  as, 
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where  So  is  the  cyclic  shift  operator  matrix, 
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and  the  blocked  cyclic  sub-convolutions  are  defined  as 
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Further  examples  can  be  found  in  the  literature6'7.  It  is  clear  that  each  sub-convolution  can  be 
separately  processed,  followed  by  a  reconstruction  stage  to  provide  the  final  result.  Parallel 
algorithms  can  be  developed  by  factorization  of  the  Block  Pseudocirculant  matrix  shown  in  (9), 
(12)  and  (13)  into  diagonal  blocks.  The  general  tensor  product  formulation  for  a  block  diagonal 
factorization  of  the  Block  Pseudocirculant  Matrix  is6, 


yr0  -  Rr„(Ar0  ®  1  N/r „ )D  H ro (B r„  ®  IN/r„)xr<. 


(15) 


where  xr0  and  yr„  are  the  decimated-by-r0  input  and  output  sequences  and  Ar0  and  Br„  are  the 
post/pre-processing  matrices,  which  are  determined  by  each  particular  factorization.  Matrix  Rr„ 
take  account  of  all  the  cyclic  shift  operators  and  Dh,u  is  the  diagonal  matrix  with  blocked 
subsections  that  are  suitable  to  be  implemented  in  parallel.  At  each  parallel  sub-section  the  same 
algorithm  can  be  applied  again.  The  general  form  of  these  matrices  for  different  variants  of  this 
algorithm  can  also  be  found  in  the  literature6.  We  are  using  the  simplest  approach,  which  is  the 
Fundamental  Diagonal  Factorization 6  that  implies  r02  independent  parallel  sub-convolutions  that 
uses  straight  forward  pre-processing  and  post-processing  stages.  This  formulation  is  the  best 
match  to  our  target  architecture  and  it  was  used  to  benchmark  the  mixed-mode,  MPI-OpenMP, 
implementation  versus  the  MPI  implementation. 

For  an  N-point  cyclic  convolution,  where  N  =  r0.s,  a  radix-2  (r0=  2)  fundamental  diagonalization 
gives  r02  =  4  diagonal  subsections.  Equations  (5)  and  (15)  with  r„=  2  can  be  written  as, 
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(17) 


The  block  diagram  realization  of  the  algorithm  embodied  in  (17)  is  shown  in  Fig.  1.  The  parallel 
realization  is  given  by  the  diagonal  matrix;  D,  while  the  pre-processing  distribution  is  given  by 
matrix  B  and  the  post-processing  cyclic  shifts  and  sums  are  given  by  matrices  R  and  A.  Note 
that,  following  (5),  (6)  and  (7),  Xo  and  Xi  are  the  decimated  subsections  yielded  by  the  stride 
permutation  applied  to  the  input  sequence.  We  also  need  to  apply  the  inverse  stride  permutation 
to  the  output  vector,  formed  by  Yo  and  Yi  in  order  to  reorder  the  output  and  give  the  final  result. 
In  the  present  implementation  the  parallel  cyclic  sub-convolutions  are  performed  at  the 
cores/nodes  and  the  pre-processing  and  post-processing  stages  are  computed  at  the  master  node. 
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The  implementation  in  each  core  is  time  domain-based  and  follows  a  serial  recursive  approach, 
to  be  discussed  elsewhere  using  a  variant  of  this  algorithm  termed  Basic  Diagonal  Factorization 
( BDF ) 6 


In/2 


Data  Shuffling  B  D  A  R  Ordered  Output 

Figure  1.  Parallel  cyclic  convolution  realization  based  on  a  Fundamental  Diagonal 
Factorization  (r0=  2)  of  the  Block  Pseudocirculant  Matrix.  Flow  graph  stages  B.  D,  A,  R 
directly  map  to  same  name  matrices  in  equations  (15).  (17). 


Hybrid,  MPI  -  OpenMP,  versus  a  Direct  MPI  Implementation 

MPI  is  capable  of  transmitting  the  data  to  the  nodes  but  it  is  not  necessarily  efficient  distributing 
the  data  to  the  cores  inside  the  multi-core  nodes.  OpenMP,  on  the  other  hand,  is  optimized  for 
shared  memory  architectures.  In  theory,  by  using  MPI  for  distributing  the  data  to  the  nodes,  and 
then  using  OpenMP  for  distributing  the  data  inside  the  multi-core  nodes,  we  can  match  the 
architecture  of  our  algorithm  to  the  architecture  of  the  cluster. 

Usually,  there  is  a  hierarchical  model  in  mixed-mode  implementations:  MPI  parallelization 
occurs  at  the  top  level  and  OpenMP  parallelization  occurs  at  the  low  level,  as  it  is  shown  in 
Figure  2.  This  model10  matches  the  architecture  of  our  cluster.  For  example,  codes  that  do  not 
scale  well  as  the  MPI  processes  increase,  but  scale  well  as  OpenMP  processes  increase,  may  take 
advantage  of  using  mixed-mode  programming.  Nonetheless,  previous  studies  show  that  the 
mixed-mode  approach  is  not  necessarily  the  most  successful  technique  on  SMP  systems  and 
should  not  be  considered  as  the  standard  model  for  all  codes10. 

If  the  parallel  subsections  shown  in  Figure  1  are  cores,  the  cyclic  sub-convolutions  are  directly 
computed  at  each  core.  If  the  parallel  subsections  are  multi-core  nodes,  the  algorithm  can  be 
applied  again  in  a  parallel-recursive  fashion  at  each  node  to  redistribute  data  among  the  cores 
within  the  node.  The  cyclic  sub-convolutions  are  again  computed  directly  at  each  core.  The 
former  scheme  lends  itself  to  a  direct  MPI  implementation  while  the  latter  scheme  is  amenable  to 
a  hybrid,  MPI-OpenMP  implementation  as  shown  in  Figure  2.  Note  that,  in  all  cases,  the  final 
sub-convolutions  at  the  cores  can  be  performed  through  a  serial-recursive  approach  using  the 
BDF  variant6  of  this  algorithm.  The  serial-recursive  process  stops  at  some  recursive  step  when 
continuing  the  recursion  becomes  more  expensive  than  directly  computing  the  sub-convolution. 
The  techniques  used  to  tackle  the  serial-recursive  approach  will  be  reported  elsewhere. 
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Figure  2.  Example  of  a  Data  Distribution  Model  for  a  Mixed-Mode  Implementation10 
16-Processor  Direct  MPI  Implementation 

For  the  direct  MPI  implementation  we  partitioned  the  cyclic  convolution  into  16  parallel 
subsections  of  one-fourth  the  original  convolution  length  using  a  radix -4  approach  (r0=  4).  This 
implementation  was  tackled  using  MPI  where  the  data  was  distributed  from  the  master  node  to 
16  parallel  cores.  In  each  core  a  serial  approach  was  used  by  recursive  application  of  the  same 
algorithm  until  the  final  cyclic  sub-convolutions  were  performed  using  a  straightforward,  time 
domain-based,  convolution  algorithm.  We  repeated  the  overall  process  for  eight  different  signal 
lengths  (2  to  2  ).  The  block  diagram  realization  is  shown  in  Figure  3. 

16-Processor  Hybrid,  MPI-OpenMP  Implementation 

We  then  implemented  a  hybrid,  MPI-OpenMP.  approach.  Using  this  technique,  data  of  one-half 
the  original  length  was  distributed  from  the  master  node  to  four  parallel  nodes  using  MPI 
following  the  architecture  in  Figure  1  (Radix-2,  r0  =  2).  Then,  under  OpenMP,  the  structure  in 
Figure  1  was  applied  again  (Radix-2,  r0  =  2)  in  a  parallel-recursive  fashion  at  each  of  the  four 
nodes.  The  final  processing  length  is  one-fourth  of  the  original  signal  length  and  both  methods 
use  16  processors.  In  each  core,  as  in  the  previous  case,  a  serial-recursive  approach  is  used  to 
compute  the  final  sub-convolutions.  We  repeated  the  overall  process  for  eight  different  signal 
lengths  (215  to  222).  The  block  diagram  realization  is  shown  Figure  4. 

64-Processors  MPI  versus  Hybrid,  MPI-OpenMP,  Implementations 

The  procedure  described  in  the  past  two  paragraphs  was  repeated  for  64  processors  using  MPI 
for  the  direct  implementation  (radix-8,  r0  =  8).  The  hybrid  implementation  used  MPI  to  distribute 
data  among  16  multi-core  nodes  (radix-4,  r0=  4)  followed  by  OpenMP  to  distribute  data  among 
the  four  processors  within  each  node  (radix-2,  r0=  2).  In  both  approaches  the  data  length,  after 
parallel  distribution,  was  one-eighth  of  the  original  signal  length.  As  with  the  16-processor 
implementation  the  parallel  sub-convolutions  in  each  core  were  computed  using  a  serial- 
recursive  approach.  The  block  diagram  for  the  direct  implementation  is  similar  to  the  one  shown 
in  Figure  3,  but  now  distributing  data  to  64  cores  using  MPI.  The  block  diagram  for  the  hybrid 
implementation  is  similar  to  the  one  shown  in  Figure  4,  but  now  we  are  distributing  data  to  16 
multi-core  nodes  using  MPI  followed  by  further  data  distribution  to  the  four  cores  within  each 
node  using  OpenMP.  We  repeated  both  processes  for  10  different  signal-lengths  (215  to  224). 
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In/4 


(Master  Node)  (Master  Node) 


Figure  3.  MPI  Implementation  using  Radix-4  (16-Core).  Stages  B,  D,  A,  R  directly  map  to  the 
same  name  matrices  in  equations  ( 15)  and  (17).  Note  that  no  node  intercommunication  occurs. 
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Figure  4.  Mixed-Mode,  MPI-OpenMP  Implementation  Recursively  using  Radix-2  (16-Cores). 
Stages  B.  D,  A,  R  in  the  signal  flow  graph  map  to  matrices  B,  D,  A.  R  in  equations  (15)  and  (17). 
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Hardware  Setup 

Hardware  Setup:  16-Nodes,  64  Processors,  Dell  Cluster 

The  current  benchmark  was  carried  out  using  a  16-node  cluster  with  multi-core  nodes.  Each  node 
has  two,  dual-core  processors  for  a  total  of  64  cores.  The  cluster  is  somewhat  dated  but  still 
useful  for  comparative  and  scalability  studies.  The  cluster  hardware  specifications  are: 

Master  Node: 

•  Brand:  Dell  PowerEdge  1950 

•  Processor:  2  X  Intel  Xenon  Dempsey  5050,  80555K  @  3.0  GHz  (Dual-Core 
Processor).  667MHz  FSB.  LI  Cache  Data:  2  x  16KB,  L2  Cache:  2  x  2MB. 

•  Main  Memory:  2GB  DDR2-533,  533MHz. 

•  Secondary  Memory:  2  X  36GB  @  15K  RPM. 

•  Operating  System:  CentOS  Linux  5.6  64-bit. 

Compute  Nodes: 

•  Brand:  Dell  PowerEdge  SC  1435 

•  Processor:  2  X  AMD  2210  @  1.8GHz  (Dual-Core  Processor).  LI  Cache  Data:  2  x 
128KB,  L2  Cache:  2  x  1MB. 

•  Main  Memory:  4GB  DDR2-667,  667MHz. 

•  Secondary  Memory:  80GB  @  7.2K  RPM. 

•  Operating  System:  Linux  64-bit. 

Software:  C++ 


Benchmark  Results 

a)  Direct  MPI  versus  Hybrid,  MPI-OpenMP,  Implementations:  4-Nodes,  16-Cores 


Table  1.  Speedup  of  direct  MPI  over  the  Hybrid  Mode  for  16  Cores.  Signal  Length  2M. 


M 

Hybrid 

Mode 

MPI 

Mode 

Speed  Up 

15 

0.1325 

0.1225 

1.081632653 

16 

0.555 

0.3375 

1.644444444 

17 

1.0425 

1.0575 

0.985815603 

18 

3.59 

2.775 

1.293693694 

19 

9.1125 

8.03 

1.134806974 

20 

25.81 

22.1925 

1.16300552 

21 

75.305 

64.1225 

1.174392764 

22 

223.218 

188.06 

1.186950973 
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Speedup  of  MPI  over  the  Hybrid  Mode  vs  Signal  Size  (16-Core) 


Exponent  M  (Signal  Length:  2M) 


-Speedup 


Figure  5.  Speedup  of  Direct  MPI  over  the  Hybrid  Mode  for  16  Cores.  Signal  Length  2M. 


b)  Direct  MPI  versus  Hybrid,  MPI-OpenMP,  Implementations:  16-Nodes,  64-Cores 


Table  2.  Speedup  of  Direct  MPI  over  the  Hybrid  Mode  for  64  Cores,  Signal  Length  2M. 


M 

Hybrid 

Mode 

(Seconds) 

MPI 

Mode 

(Seconds) 

Speedup 

15 

0.1 1 

0.25 

0.44 

16 

0.305 

0.2575 

1.184466019 

17 

0.7775 

0.665 

1.169172932 

18 

1.97 

1.545 

1 .275080906 

19 

4.2825 

3.835 

1.116688396 

20 

10.58 

9.7975 

1.079867313 

21 

28.69 

26.4 

1 .086742424 

22 

80.1725 

72.3175 

1.108618246 

23 

233.21 

203.01 

1.148761 145 

24 

736.34 

597.607 

1.232147548 
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Speedup  of  MPI  over  the  Hybrid  Mode  vs  Signal  Size  (64-Core) 
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Figure  6.  Speedup  of  Direct  MPI  over  the  Hybrid  Mode  for  64  Cores.  Signal  Length  2M. 


We  can  see  that  as  the  signal  length  increases  the  direct  MPI  implementation  becomes  faster  than 
the  mixed-mode  implementation.  Figure  5,  6,  Table  1,  2.  The  fact  that  the  hybrid  implementation 
is  slower  than  the  MPI  implementation  appears  to  be  mainly  because  of  the  compute  time  at  the 
sub-convolution  stage  within  the  nodes.  Considering  that  the  absolute  time  is  slightly  larger  for 
the  mixed-mode  approach,  and  looking  to  Figures  7,  8,  it  can  be  concluded  that  the  transfer  times 
are  similar  for  both  methods.  There  could  be  some  particularity  in  the  OpenMP  portion  of  the 
implementation  that  is  causing  a  relative  slowdown  in  the  compute  time  for  the  mixed-mode 
approach.  If  such  a  problem  could  be  detected,  further  code  optimization  may  be  needed  if  we 
were  to  improve  the  performance  of  the  hybrid  implementation  versus  the  MPI-only  approach. 


c)  Transfer  Time  Studies 

Figure  7  shows  the  transfer  times  for  the  direct  MPI  implementation.  The  transfer  time 
percentage  is  slightly  less  for  the  hybrid  MPI-OpenMP  parallel  implementation  shown  in  Figure 
8.  The  percentage  of  transfer  time,  with  respect  to  the  total  time,  increases  as  we  increase  the 
number  of  processors.  This  is  because  the  total  time  for  a  fixed  signal  size  decreases  as  more 
parallel  processors  are  added.  For  a  fixed  number  of  processors  as  the  signal  length  increases  the 
percentage  of  transfer  time,  compared  to  the  total  execution  time,  decreases.  Note  that  the  mixed¬ 
mode  approach  needs  a  minimum  of  16  parallel  processors  using  the  proposed  formulation.  The 
four-processor  MPI  implementation  is  shown  just  for  reference  since  we  are  comparing  the  16 
and  64-processor  implementations  using  the  MPI  approach  versus  the  mixed-mode  approach. 


Transfer  Time  Percentage  of  Total  Time  Transfer  Time  Percentage  of  Total  Time 
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Transfer  Time  Percentage  of  Total  Execution  Time  versus  Number 
of  Processors  (MPI  Implementation) 


♦  Signal  Length  =  2A20 
M  Signal  Length  =  2A21 
— * — Signal  Length  =  2A22 


Transfer  Time  Percentage  of  Total  Execution  Time  versus  Number  of  Processors  (4, 
16,  64)  for  the  MPI  Implementation. 


Transfer  Time  Percentage  of  Total  Execution  Time  versus  Number 
of  Processors  (Mixed-Mode  Implementation) 


♦  Signal  Length  =  2A20 
■  Signal  Length  =  2A21 
— * — Signal  Length  =  2A22 


Figure  8.  Transfer  Time  Percentage  of  Total  Execution  Time  versus  Number  of  Processors  (16, 

64)  for  the  MPI-OpenMP  Hybrid  Implementation 
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d)  Absolute  Speedup  of  a  Parallel  Implementation  versus  a  One-Core  Serial  Implementation. 

For  completeness  we  provide  the  speedup  of  the  MPI-only  parallel-recursive  implementation 
versus  the  direct  serial -recursive  implementation.  Fig.  9,  Fig.  10.  Since  the  parallel  technique 
proposed  in  this  paper  uses  a  serial-recursive  approach  to  perform  each  parallel  sub-convolutions 
it  can  be  benchmarked  against  the  performance  of  a  serial-recursive  implementation  in  a  single 
core.  This  benchmark  shows  that,  as  the  signal  length  increases,  the  use  of  additional  parallel 
processors  increases  the  speedup  of  the  parallel  approach  versus  the  serial  approach.  The 
increase  in  performance  provided  by  this  parallel  scheme  is  sublinear. 


Parallel-Recursive  Implementation  (16  cores,  r0=4)  using  MPI 
versus  a  One-Core  Serial-Recursive  Implementation 
Lengths:  215  to  222 


a 

o 

•a 

a> 

<v 

a 


Parallel-Recursive  16-Core 


Figure  9.  16-Core  Parallel-Recursive  versus  a  One-Core  Serial-Recursive  Cyclic  Convolution 

Implementation  Speedup  using  C++ 


Parallel-Recursive  Implementation  (64  Cores,  r0=8)  using 
MPI  versus  a  One-Core  Serial-Recursive  Implementation 
Lengths:  215  to  222 


Parallel-Recursive 

64-Core 


Figure  10.  64-Core  Parallel-Recursive  versus  a  One-Core  Serial-Recursive  Cyclic  Convolution 

Implementation  Speedup  using  C++ 
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Educational  Component 

The  reported  case-study  represents  an  example  of  a  complex  algorithm  that  can  be  reformulated 
in  order  to  more  closely  conform  to  the  underlying  target  architecture.  Since  case-studies  provide 
a  focused  framework  for  the  application  of  engineering  knowledge11,  we  have  provided 
sufficient  amount  of  detail  so  it  can  be  used  as  a  point  of  departure  for  further  research.  It  can 
also  be  used  as  an  educational  resource  in  algorithm  manipulation,  and/or  further  code 
development,  regarding  the  choice  of  mixed-mode  (MPI-OpenMP)  versus  direct  MPI 
implementations . 


Future  Work 

As  a  future  work  we  plan  to  benchmark  for  memory  efficiency,  where  the  mixed-mode  approach 
could  have  an  advantage,  and  to  advance  our  code  optimization  efforts  seeking  an  increased 
performance  for  both  techniques. 


Conclusions 

By  using  MPI  for  distributing  data  to  the  nodes  and  then  using  OpenMP  to  distribute  data  among 
the  cores  inside  each  node,  we  matched  the  architecture  of  our  algorithm  to  our  target  cluster 
architecture.  Each  core  processes  an  identical  program  with  different  data  using  a  single  program 
multiple  data  (SPMD)  approach.  All  pre  and  post-processing  tasks  were  performed  at  the  master 
node. 

We  found  that  the  MPI  implementation  had  a  slightly  better  performance  than  the  mixed-mode, 
MPI-OpenMP  implementation.  We  established  that  the  speedup  increases  very  slowly,  as  the 
signal  size  increases,  in  favor  of  the  MPI-only  approach.  Looking  at  Figure  3  and  4  it  becomes 
clear  that  the  MPI  portion  of  both  implementations  should  not  suffer  from  load  unbalancing 
problems. 

The  transfer  time,  as  percentage  of  total  time,  is  slightly  less  for  the  mixed-mode  parallel 
implementation  than  for  the  MPI  implementation.  Since  the  total  time  for  the  hybrid  approach  is 
slightly  higher,  the  transfer  times  for  both  approaches  are  basically  the  same.  Given  that  the 
transfer  times  are  similar,  there  could  be  some  particularity  in  the  OpenMP  portion  of  the 
implementation  that  is  causing  a  relative  slowdown  in  the  compute  time  versus  the  MPI 
approach.  If  such  a  problem  can  be  detected  further  code  optimization  may  be  needed  if  we  were 
to  improve  the  performance  of  the  mixed-mode  implementation  versus  the  MPI  implementation. 

When  there  are  no  unbalancing  problems  or  other  issues  in  the  MPI  portion  of  the  algorithm  the 
direct  MPI  option  may  be  adequate10.  This  could  account  for  the  slightly  less  performance 
exhibited  by  the  mixed-mode  approach  despite  its  tight  match  to  the  target  architecture.  This 
underscores  the  fact,  as  stated  in  the  literature10,  that  the  mixed-mode  approach  is  not  necessarily 
the  most  successful  technique  on  SMP  systems  and  should  not  be  considered  as  the  standard 
model  for  all  codes. 
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As  expected,  when  the  signal  length  increases  the  use  of  additional  parallel  processors 
sublinearly  increases  the  performance  of  the  parallel-recursive  approach  versus  the  serial- 
recursive  implementation  in  a  single  core. 

The  presented  case-study  can  be  used  as  point  of  departure  for  further  research  or  can  also  be 
used  as  an  educational  resource  in  algorithm  manipulation,  and/or  additional  code  development, 
regarding  the  choice  of  mixed-mode  (MPI-OpenMP)  versus  direct  MPI  implementations.  The 
authors  will  gladly  provide  supplementary  information  or  resources. 
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